Hummer: Mitigating Stragglers with Partial Clones
نویسندگان
چکیده
Small jobs typically run for interactive data analyses in datacenters, and often are delayed by long-running tasks called stragglers. Many efforts, like Blacklist, speculative execution, proactive mitigation, have been devoted to the solutions. However, they either consume too much time or waste too many resources. In this paper, we propose a new proactive method to mitigate stragglers by performing partial clones, which improves job average duration by 48% and 18% compared to LATE and Dolly.
منابع مشابه
Gradient Coding: Avoiding Stragglers in Distributed Learning
We propose a novel coding theoretic framework for mitigating stragglers in distributed learning. We show how carefully replicating data blocks and coding across gradients can provide tolerance to failures and stragglers for synchronous Gradient Descent. We implement our schemes in python (using MPI) to run on Amazon EC2, and show how we compare against baseline approaches in running time and ge...
متن کاملGradient Coding
We propose a novel coding theoretic framework for mitigating stragglers in distributed learning. We show how carefully replicating data blocks and coding across gradients can provide tolerance to failures and stragglers for synchronous Gradient Descent. We implement our scheme in MPI and show how we compare against baseline architectures in running time and generalization error.
متن کاملFine-Grained Micro-Tasks for MapReduce Skew-Handling
Recent work on MapReduce has considered the problems of skew, where a job’s tasks exhibit large variance in size and processing cost, and stragglers, tasks that run slowly due to conditions on particular nodes. In this paper, we discuss an extremely simple approach to mitigating skew and stragglers: break the workload into many small tasks that are dynamically scheduled at runtime. This approac...
متن کاملEffective Straggler Mitigation: Attack of the Clones
Small jobs, that are typically run for interactive data analyses in datacenters, continue to be plagued by disproportionately long-running tasks called stragglers. In the production clusters at Facebook and Microsoft Bing, even after applying state-of-the-art straggler mitigation techniques, these latency sensitive jobs have stragglers that are on average 8 times slower than the median task in ...
متن کاملNear-Optimal Straggler Mitigation for Distributed Gradient Methods
Modern learning algorithms use gradient descent updates to train inferential models that best explain data. Scaling these approaches to massive data sizes requires proper distributed gradient descent schemes where distributed worker nodes compute partial gradients based on their partial and local data sets, and send the results to a master node where all the computations are aggregated into a f...
متن کامل